INTRODUCTION

As a complex, yet inevitable aspect of all nations, poverty has profound consequences on the well-being and development of individuals, families, and entire communities, creating a cycle of disadvantage and inequality. Understanding and addressing the root causes of poverty are crucial steps towards creating a more equitable and sustainable world.

To obtain our data set, we worked as research assistants under the Social Innovation and Entrepreneurship (SiE) Lab. Through its expansive network of research faculty, data analysts, and subject matter experts, the SiE Lab provides consulting services to nonprofits and social enterprises. Consulting and technical assistance is available to these partners for assessment needs, evaluation, strategic planning, and capacity building. Specifically, our data was collected through an initiative of the SiE Lab called the Community Aspirations Hub. This hub utilizes a tool called Aspire which is a technology-based self-assessment that identifies strengths and needs related to wellbeing across multiple dimensions. Aspire utilizes a multidimensional framework measuring 55 indicators of wellbeing across 6 dimensions; these dimensions closely align with the social determinants of health. Aspire can then be utilized by participants to identify priorities for growth to implement within a personalized plan for change. Additionally, participants are paired with an Aspire coach who aids them in achieving their priorities.

Once we obtained our data set and conducted extensive cleaning and research, we arrived at two “big idea” questions. Before proceeding with our analysis, we undertook an extensive data cleaning process to ensure the accuracy and quality of our dataset. Due to the nature of our group working with real, uncleaned data, this involved identifying and addressing missing values, handling outliers, and validating the data for consistency. Additionally, we carefully checked for any potential errors or inconsistencies in the data collection process. The data cleaning process plays a crucial role in the reliability of our research findings and supports the validity of our conclusions.

In data science, one may ask if it is possible to build a predictive model to fit their data. This brings us to our first question: Which grouping of indicators is most predictive of annual income? We believe this question is interesting and could be explored further to study which indicators are more important to annual income than others. For instance, many participants may receive results from Aspire which indicate multiple indicators measured in the medium or low category. If this is the case, it falls to the participant and coach to decide which indicators should be worked on first – or at all. Ultimately, this is much too subjective and could be better analyzed using the methods we perform below to determine which indicators should be prioritized by coaches and their participants. Considering the variability across locations, we will reiterate that the methods performed below can be replicated, but the results should not be assumed to be the same when applied to different locations.

Are certain races negatively impacted more significantly than others by living conditions – when impact is represented as a measure of annual income? We find this question intriguing as it could be leveraged by international poverty stoplight partners to study which race groups are impacted living conditions and the differences in impact between living situations. We chose this question due to the inherent nature of bias created in measuring poverty. It is evident that individuals born into poverty will have outside factors that may cause the situation they’re in or prevent them from escaping, thus the cycle of poverty. Often, without a helping hand to intervene this cycle never ends. One example of this cycle is in the living situations which individuals are subjected to. An example of this cycle is evident in the living situations to which individuals are subjected. I would ask you, the reader, a rhetorical question: Who is more likely to succeed, a white man born into a household with three bedrooms, one for him, his parents, and his single sibling, or a black woman born into a household with a single bedroom for her, her parents, and her seven siblings? Though the answer may be intuitive, we will delve deeper through analysis to first confirm this observation and then measure the impact.

DATA

In general, Aspire data is geocoded and can be analyzed at the neighborhood and community level. The SiE Lab collected the data used in our data set through their partner Telamon, a nonprofit dedicated to disrupting the cycle of poverty by empowering families to overcome barriers to success. The data set we chose to analyze is hyperlocal with observations being defined as individuals located in Cabarrus County with at least one child aged five or younger. Because of this, any ostensible conclusions drawn from this data may at first appear to be true of the population as a whole but cannot be generalized and should be regarded only as representative of the sample we’ve analyzed. However, because our data is still drawn from an extremely diverse subset of individuals with differences in race, gender, and income it should be regarded as interesting and significant as an analysis of individuals in Cabarrus county. Moreover, our methods could be replicated by others looking to perform hyperlocal analysis of other subsets of the overall population.

While our data may have been hyperlocal, the project is not. As a whole, the poverty stoplight project is an international mission to activate the potential of families to discover practical and innovative solutions to improve their life in all aspects. In fact, the project is currently being used in over forty seven countries by fifteen hubs. There have been over four hundred organizational partners on the project with over two hundred and twenty thousand stoplights created worldwide. Our goal of analyzing Cabarrus County is to inspire others to utilize our methods and to create their own adaptations of these methods to perform critical analysis of their own community.

To answer our questions, we looked at four primary variables in addition to the 55 indicators of wellbeing. Given the nature of our data being measured predominantly on families in poverty, one of the most important variables was “Annual Income”. There were 181 observations of “Annual Income” in our data, and most families reported an income between $0 and $30,000. Another variable that we found important was “Race”. Out of 596 observations for the “Race” variable, our demographics were the following: 19% White, 29% Black or African-American, 31% Hispanic, 17% Other, and just a few observations belonged to Native Hawaiian or Other Pacific Islander, American Indian or Alaska Native, and Asian. It is important to note, however, that there is a strong relationship between “Other” and “Hispanic”, as most families who chose “Other” also chose “Hispanic”. This suggests that the majority of our data is on Hispanic families, meaning Hispanic families are underrepresented when performing our analysis. Unfortunately, there was no way to solve this issue, and is actually something we recommend changing when families fill out the survey. “Housing Situation” was another variable we wanted to look closely at. There were seven different responses including “Single Family House”, “Living With Family”, “Other”, “Apartment/Condominium”, “Group Home”, “Trailer/Mobile Home”, and “Transitional Housing”, across 404 observations. However, looking at the graphic, most of the observations fall within just three responses. In relation to this variable, we also looked into whether the family rented or owned their home, in hopes to find disparity between the two.

Lastly, the most important aspect of our data were the 55 indicators of personal well-being. These are categorical variables with responses “1”, “2”, and “3”. If the respondent answered “1”, that means they do not feel that they are well-off in the given indicator. For example, if the indicator is “Food access” and the respondent answers with “1”, then they do not feel they have sufficient access to food. Alternatively, if the respondent answers with “3”, then they feel comfortable with their situation for the given indicator. Very few respondents reported “0”, and zero indicates that they elected to not answer the question. We changed any “0” responses to “NA”.

Income Categories Count
< $5k 9
< $10k 20
< $15k 27
< $20k 36
< $25k 36
< $30k 23
≥ $30k 30
Race Count
White 111
Black or African American 175
Asian 1
Hispanic 182
American Indian or Alaska Native 16
Native Hawaiian or Other Pacific Islander 7
Other 104
Household Type Count
Single Family House 191
Living With Family 1
Other 1
Apartment/Condominium 63
Group Home 3
Trailer/Mobile Home 141
Transitional Housing 4

RESULTS

Question 1: Which grouping of indicators are most predictive of annual income?

To address this question, we first delve into the distribution and relationships between annual income and the 55 indicators of wellbeing. These indicators encompass aspects such as food access, health services, environment, savings, and more, and are self-evaluated, ranked from 1 to 3, with 0 indicating the observer skipping the questions. To ensure data integrity, we undertake data cleaning by replacing all 0 values with NAs. Subsequently, we calculate the root mean square error (RMSE) of each indicator after removing the NAs, as displayed in the table provided.

Additionally, we group the indicators according to the six dimensions presented by the SiE Lab: Income & Employment, Health & Environment, Housing & Infrastructure, Education & Culture, Organization & Participation, and Interiority & Motivation. For each group, we compute the mean RMSE to gain insight into their predictive capabilities. Notably, we identify the top 6 indicators overall, namely bank services, close relationship, vaccines, phone, bathroom, and healthy vision, all exhibiting RMSE values very close to zero. Furthermore, we extract the best indicators from each group, which are generating income, vaccines, phone, bank services, agency, and close relationships. We also pick out the best indicators from each group, which are generate income, vaccines, phone, bank services, agency, and close relationships.

This research paper investigates the effectiveness of indicator-based models in predicting annual income within Cabarrus County. Five different models were chosen to determine the significance of various regressors in the prediction process. These models included: (1) “All 55 Indicators,” (2) “Empty Model,” (3) “Top 6 Indicators Overall,” (4) “Best Indicators from Each Group,” and (5) “Best Group of Indicators.” Leave-one-out cross-validation (LOOCV) was employed on 244 observations with reported income to evaluate the model performance using the root mean square error (RMSE).

Our findings reveal that none of the selected models provided accurate predictions of annual income. The average RMSE values of all models were higher than that of the empty model, suggesting that the use of these models was less effective than simply relying on average annual income for predictions. We attribute this limitation to the narrow range and low variance of indicator values, as well as the subjective nature of data collected from families. Moreover, certain observations with counterintuitive income-indicator relationships further complicated the prediction task. Based on our analysis, we propose that the predictive power of these indicators could be enhanced by providing survey recipients with a wider range of values and categorical reference points, which may reduce bias and lead to more representative data.

RMSE Summary
Lowest RMSE from Each Group Empty Regression Group with Lowest Average RMSE All Indicators 6 Lowest RMSE
26057.21 25273.08 32606.51 67976.63 73202.81
a Note: All RMSE values yielded that the models were not good fits for the data.

Question 2: Are certain races negatively impacted more significantly than others by living conditions – when measured by annual income?

To begin our analysis, we compared the housing situations of different racial groups using both bar plots and pie charts. The bar plot presents the distribution of house types within each racial group based on counts, while the pie chart illustrates the proportions of each house type within the group. Among the racial groups “Black or African American” and “Other,” we observed similar results. The most popular house type for both groups was the single-family house, which accounted for half of the population. The second most prevalent house type was the apartment/condominium, making up around one-third of the group. For the racial groups “White” and “Hispanic,” we noticed a similar distribution pattern. Both groups showed single-family houses and trailer/mobile homes as the most popular categories, each occupying approximately half of the group. “Asian” and “American Indian or Alaska Native” racial groups also demonstrated similar distributions. In both cases, their observations were fewer than five, with the only house type being the single-family house.

Next, we delved into the rent/own situation within each racial group using bar plots and pie charts. For the racial group “Black or African American,” approximately 75% of the group rented their place, while the rest either owned their homes or lived with family or partners. Similarly, for the “Hispanic” group, around 65% rented their place, with the remaining owning their homes or living with family or partners. Both the “White” and “Other” racial groups demonstrated a similar distribution. Roughly 50% of these groups rented their houses, while 35% owned their homes, and the remainder lived with family or partners. The “Asian” and “American Indian or Alaska Native” racial groups had limited observations, with both reporting fewer than five instances, and the only rent/own situation they reported was “rent.”

The graph below depicts the average annual income by housing situation, separated by race. There are also 95% confidence intervals for each housing type, to aid in estimating where the true average for each type could fall. Notice, though, that not all household types have confidence intervals and some household types do not even have observations for average annual income. This is because there are too few observations for that group, so it is impossible to obtain these intervals. Looking at the data we do have, however, we see that income changes drastically between both housing type and race. We can determine that income is jointly dependent on both of the variables. The results demonstrate that Black or African-American families are much better off than any other race in single-family houses and apartments, and families who claimed “Other” for race seem to have the lowest annual income across all housing types.

The graph below shows average annual income by housing situation, separated by race. The difference. However, race is partitioned as “Race Status”. “Race Status” is a variable our group implemented to categorize families into: “Only White”, “Only Black or African-American”, “Other or Single-Race Minority”, and “Mixed Race”. This graph is salient because it builds upon the previous graph to examine the relationship between minority status –and the levels of this status – and the impact on “Annual Income”. For example, if a family was both Black and White, they were counted twice in the previous graph – once in the White race and once in the Black race. In this graph, the family would be included in mixed-race, so there is no double-counting in this graph. This means the number of observations for each group decreases, but the results reflect our data more accurately in terms of race. Also, it is important to mention that the “Other Single-Race Minority” category has a very low number of observations. We believe this to be true because families who said they were Hispanic also checked off another race. This means all Hispanic families fall into the “Mixed-Race” category, along with families who are truly mixed-race. This problem can be solved by the Poverty Stoplight Organization by terminating the “Hispanic” Variable when logging data on families, and adding “Hispanic” to the “Race” variable. It seems that Hispanic families seem confused when responding to the question regarding race, so by doing this, the confusion no longer exists, and Hispanic families can be represented properly. Looking at the graph, the results are similar to that of the previous graph. Black families have a much higher average income relative to other race types. Also, the mixed race has consistent and plenty of data, but it seems as though the income levels are lower when compared to single-race families. In both graphs, we cannot come to any conclusions about other single-race minorities such as American Indian or Native Hawaiian. There is simply not enough data to support any claims.

CONCLUSION

We found that the indicators meant to be used as indicators of poverty cannot be used to predict poverty. In fact, many different models utilizing different subsets of the best performing indicators were used to predict poverty and it was found that an empty model containing no indicators would perform better at predicting poverty than any of the indicators measured in the self-evaluated survey.

Although our results were not as strong as we may have initially hoped. In the real world, our methods of analyzing the data could be used internationally by any of the over four hundred poverty stoplight partners to analyze their hyperlocal location. We would expect their results to differ from our own but to represent the indicators within this location which are most predictive of annual income and the races which are most negatively impacted by living conditions. As population size of race and racial prejudice tend to differ in different parts of the world, I would expect these results to differ significantly from our own.

Another area for improvement within our own data analysis and within future analysis utilizing these methods would be to collect a much larger sample size. A pitfall of our data shortage can be seen particularly within the asian subset where, after cleaning the data, we had no observations which also had an observed annual income. This was an entire subset of the population which had no findings in the end result and, therefore, weakened the conclusion. This could have potentially biased the data with indicators affecting the Asian population in a different way than other races, however, so long as our results are not generalized to the Asian population of Cabarrus county the conclusions hold for others. Also, there is a potential for KNN or a similar ML implementation that could find a model which is better fit to the data, however, I would caution anyone performing this analysis to avoid overfitting this model to the data.